Solar energy forecast

Using Julia to predict the solar energy production

Author

Nicoló Foppa Pedretti

Published

June 27, 2025

1 Import libraries

Import libraries for data cleaning, statistics, machine learning and visualization

using Dates, DataFrames, Plots, StatsPlots, Statistics, Distributions
using StatsBase, HypothesisTests, LinearAlgebra, Random, MLJ, CSV, CategoricalArrays 
Precompiling StatsPlots


Clustering

MultivariateStats

StatsPlots

  3 dependencies successfully precompiled in 10 seconds. 258 already precompiled.


Precompiling HypothesisTests

HypothesisTests

  1 dependency successfully precompiled in 4 seconds. 86 already precompiled.

2 Dataset

Our data contains information on these key factors:

  • timedate: date and hour of each datapoint
  • WindSpeed: wind speed in Km/h
  • Sunshine: minutes per hours that sun is not cover by clouds (scaled 0-60)
  • AirPressure: athmosphere pressure in hPa
  • Radiation: solar radiation W/m2
  • AirTemperature: air temperature in Celsius
  • RelativeAirHumidity: relative humidity (scaled 0-100)
  • SystemProduction: system production kWh
DT = CSV.read("Solar_Power_Plant_Data.csv", DataFrame);
show(describe(DT),allcols = true)
8×7 DataFrame

 Row  variable             mean     min               median  max               nmissing  eltype    Symbol               Union…   Any               Union…  Any               Int64     DataType 

─────┼──────────────────────────────────────────────────────────────────────────────────────────────

   1 │ Date-Hour(NMT)                01.01.2017-00:00          31.12.2017-23:00         0  String31

   2 │ WindSpeed            2.63982  0.0               2.3     10.9                     0  Float64

   3 │ Sunshine             11.1805  0                 0.0     60                       0  Int64

   4 │ AirPressure          1010.36  965.9             1011.0  1047.3                   0  Float64

   5 │ Radiation            97.5385  -9.3              -1.4    899.7                    0  Float64

   6 │ AirTemperature       6.97889  -12.4             6.4     27.1                     0  Float64

   7 │ RelativeAirHumidity  76.7194  13                82.0    100                      0  Int64

   8 │ SystemProduction     684.746  0.0               0.0     7701.0                   0  Float64

In the table above we don’t have any real missing data but we can see some problematic values (like the negative values for solar radiation). We assume that each negative value of solar radiation is due to some data trasmission error so we set all the negative value equal to zero, in addition we parse the data related to the hour and date to timedate format

DT[DT.Radiation .< 0,:Radiation] .= 0.0;

DT.day = parse.(Int64,chop.(DT[:,"Date-Hour(NMT)"], head = 0, tail = 14))
DT.month = parse.(Int64,chop.(DT[:,"Date-Hour(NMT)"], head = 3, tail = 11))
DT.hour = parse.(Int64,chop.(DT[:,"Date-Hour(NMT)"], head = 11, tail = 3));

rename!(DT,"Date-Hour(NMT)" => "timedate");

# Create column with right formatting
DT.timedate_real = DateTime.(2017,DT.month,DT.day,DT.hour);
DT.date_real = Date.(2017,DT.month,DT.day);

Let’s visualize the SystemProduction for each hour:

plot(DT.timedate_real, DT.SystemProduction, 
    title="Hourly production", label= :none, size=(750,300))

From the plot we notice that we have some multiple consecutive days where the production is zero (for example in Jan, May, Dec). It seems that instead of missing value we have some zero value when we don’t have available data. We need to be very careful while performing the data cleaning because during the night the actual production of the solar panel is zero. Let’s group by day and check the days with zero production

df = groupby(DT, :date_real)
dt = combine(df, 
             ["SystemProduction","WindSpeed",
             "Sunshine","AirPressure",
              "Radiation","AirTemperature",
              "RelativeAirHumidity","month"] .=> [sum, mean, mean, 
                                                  mean, mean, mean, 
                                                  mean, mean]; 
    renamecols = true);
sort!(dt,:date_real);

p1 = scatter(dt.Radiation_mean, dt.SystemProduction_sum, title = "Production vs Radiation (day)", 
             label= :none)
p1 = vline!([40], label= :none)
p1 = hline!([1800], label= :none)
p2 = scatter(DT.Radiation, DT.SystemProduction, title = "Production vs Radiation (hour)", 
             label= :none)

plot(p1, p2, layout=(1,2), size=(750,300))

From the plot is clear that we have some outlier where radiation is grater than 40 W/m2 and production is lower than 1800 kWh per day. Let’s mark those days as suspicious and check if we have hour that can have some suspiciuos data, then compute the correlation between variables in the clean dataset

suspect_day = dt[(dt.Radiation_mean .> 40) .&& (dt.SystemProduction_sum .< 1800),:date_real]
filter!([:date_real, :SystemProduction, :Radiation] => (x,y,z) -> x  Ref(suspect_day) && 
        !(y == 0 && z > 40), DT)
cor(Matrix(DT[:,2:8]))
7×7 Matrix{Float64}:
  1.0         0.133932   -0.0432693  …   0.198341   -0.337062    0.205243
  0.133932    1.0         0.0254894      0.392487   -0.605845    0.656582
 -0.0432693   0.0254894   1.0           -0.0323885  -0.0825695   0.0125299
  0.189951    0.798208    0.0211536      0.535135   -0.616523    0.861283
  0.198341    0.392487   -0.0323885      1.0        -0.384636    0.503138
 -0.337062   -0.605845   -0.0825695  …  -0.384636    1.0        -0.593409
  0.205243    0.656582    0.0125299      0.503138   -0.593409    1.0

Last step before the machine learning model is to include the time into a numerical variable using Cyclical Encoder. This methods allow us to take into account the time cyclicity for months, days, hours:

\[s(t) = \sum_{n=1}^{N} \left( a_n \cos\left(\frac{2\pi n t}{P}\right) + b_n \sin\left(\frac{2\pi n t}{P}\right) \right)\]

With the following code:

function cyclical_encoder(df::DataFrame, columns::Union{Array, Symbol}, max_val::Union{Array, Int} )
    for (column, max) in zip(columns, max_val)        
        df[:, Symbol(string(column) * "_sin")] = sin.(2*pi*df[:, column]/max)
        df[:, Symbol(string(column) * "_cos")] = cos.(2*pi*df[:, column]/max)
    end
    return df
end

cyclical_encoder(DT, ["day","month","hour"], [31,12,23]);

3 Machine Learning

EvoTrees is a regression algorithm in Julia library for creating gradient boosting regression models. It allows you to build decision trees efficiently, focusing on performance. EvoTrees works by combining multiple weaker decision trees into a stronger final model. It supports various loss functions specifically designed for regression tasks, which guide the training process and evaluate how well your model performs. The library utilizes histogram-based algorithms for faster data processing and can also handle different types of features within your data, including categorical ones. Overall, EvoTrees provides a versatile toolkit for building regression models in Julia using gradient boosting.

Split the dataset in train and test considering only the hours with solar radiation grater than zero (exclude nights and evenings). We consider train from 01/01 to 06/30 and test from 07/01 to 12/31. We perform hourly estimation and then we group by day of the year. We include all the available variable into the model:

DT_model = DT[DT.Radiation .> 0.0,:]
train, test = (collect(1:1828),collect(1829:3744));
X = DT_model[:,vcat(2:7,14:19)];
y = DT_model[:,:SystemProduction];

We need to load the EvoTreeRegressor algorithm, set the parameters, create the machine and cross validate the model using 5-folds repeating the operation for 5 times

EvoTreeRegressor = MLJ.@load EvoTreeRegressor pkg=EvoTrees verbosity=0;
et_regressor = EvoTreeRegressor(nbins = 32, max_depth = 10, nrounds = 200);

model_glm = et_regressor;
mach_glm = machine(model_glm, X[train,:], y[train]);
fit!(mach_glm, verbosity=0);

# Cross-validation
evaluate!(mach_glm, resampling = CV(nfolds=5, rng=1234), 
          repeats=5, measure = [rmse, rsquared], verbosity=0);
Precompiling EvoTrees

EvoTrees

  1 dependency successfully precompiled in 5 seconds. 105 already precompiled.

4 Results

DT_model.predicts = zeros(size(DT_model,1))        
DT_model[test,:predicts] .= MLJ.predict(mach_glm, X[test,:]);

df = groupby(DT_model, :date_real)
dt = combine(df, ["SystemProduction","predicts"] .=> [sum, sum]; renamecols = true);
sort!(dt,:date_real);
dt = dt[dt.date_real .> Date.(2017,6,30),:]

q2 = plot(dt[:,:date_real],dt[:,:SystemProduction_sum],  title = "Actual vs Predict", label = "Actual")
q2 = plot!(dt[:,:date_real],dt[:,:predicts_sum], mc = :orange, label = "Predict")
q1 = scatter(dt[:,:SystemProduction_sum],dt[:,:predicts_sum], title = "Actual vs Predict", label = :none)
q1 = plot!(collect(0:59000),collect(0:59000), label = :none, mc = :red)

plot(q1, q2, layout=(1,2), size=(750,300))
println("RMSE: ", string.(rmse(dt[:,:SystemProduction_sum],dt[:,:predicts_sum])))
println("MAE: ", string.(mae(dt[:,:SystemProduction_sum],dt[:,:predicts_sum])))
println("R²: ", string.(cor(dt[:,:SystemProduction_sum],dt[:,:predicts_sum]).^2))
RMSE: 4001.6449123866632
MAE: 2863.85133695596
R²: 0.9479576782299722

5 References